We’re excited to announce the release of the newest version of the code that generates the database used in “Is There a Replication Crisis in Finance?” (Jensen, Kelly, and Pedersen, Journal of Finance 2023), as well as the factor portfolio returns.
Highlights of this release:
Migration to Python: We’ve switched from SAS and R to Python.
Modular, faster codebase: The code is now more modular and leverages Polars and DuckDB for performance.
Single-command execution: The characteristics database and factor returns can be computed in a single Slurm job or a single Python command.
Substantial speed-ups: The full database and factor returns can be generated in ~6 hours.
Modern output formats: Outputs have moved from CSV to Parquet.
For details on factor definitions and construction, see: Global Factor Data.
Naturally, changing the programming language led to some differences in the underlying algorithms relative to SAS. Nevertheless, most functions and the overall structure of the SAS/R codebase have been retained. The existing SAS/R documentation remains applicable and, in many cases, is more extensive than the current Python documentation.
Output structure
All output is contained in:
SAS-Python-Migrate/build_database/data/processed/
The output is organized as follows:
processed/
accounting_data/
quarterly.parquet and annual.parquet contain stock-level accounting characteristics from Compustat data sourced from quarterly and annual filings, respectively.
characteristics/
world_data_unfiltered.parquet contains the monthly stock file with stock-level characteristics, without filters.
world_data_filtered.parquet contains the monthly stock file with stock-level characteristics filtered by primary_sec == 1 (primary security flag), common == 1 (common stock flag), obs_main == 1 (main observation flag), exch_main == 1 (main exchange flag).
Partitions of world_data_filtered.parquet by country (e.g., GBR.parquet, USA.parquet). Files are named using ISO Alpha-3 country codes.
other_output/
ap_factors_monthly.parquet and ap_factors_daily.parquet contain the returns of FF3 and HXZ4 factor portfolios for every country at monthly and daily frequencies, respectively.
market_returns.parquet and market_returns_daily.parquet contain market-portfolio returns for every country at monthly and daily frequencies, respectively.
nyse_cutoffs.parquet contains NYSE market-capitalization quantiles used for portfolio construction.
return_cutoffs.parquet and return_cutoffs_daily.parquet contain return quantiles used as winsorization thresholds.
portfolios/
pfs(_daily).parquet contains portfolios formed at monthly (daily) frequency by sorting stocks into three groups based on non-microcap breakpoints. Portfolio 1 (3) includes stocks with the lowest (highest) value of the characteristic.
hml(_daily).parquet contains long–short portfolios formed at monthly (daily) frequency that go long Portfolio 3 (high characteristic values) and short Portfolio 1 (low characteristic values) from pfs.csv.
lms(_daily).parquet contains long–short portfolios formed at monthly (daily) frequency following the Jensen, Kelly, and Pedersen (2022) signing convention (e.g., long low-asset-growth, short high-asset-growth).
cmp(_daily).parquet contains rank-weighted (characteristic-managed) portfolios formed at monthly (daily) frequency within mega, large, small, micro, and nano capitalization groups in the U.S.
country_factors(_daily)/ contains country-by-country lms.parquet files for easier use at monthly (daily) frequency. Files are named using ISO Alpha-3 country codes.
regional_factors(_daily)/ contains regional factor portfolios based on lms.parquet at monthly (daily) frequency, constructed following Jensen, Kelly, and Pedersen (2022).
world_dsf.parquet contains daily stock-level returns for all securities in the database.
world_ret_monthly.parquet contains monthly stock-level returns for all securities in the database.
daily_rets_by_country/ contains partitions of world_dsf.parquet by country (e.g., GBR.parquet, USA.parquet). Files are named using ISO Alpha-3 country codes.
The SAS-Python-Migrate/build_database/data/processed/ directory also contains two additional subfolders, raw/ and interim/, used by the code to store temporary data for computing stock-level characteristics.
To demonstrate the quality of the new dataset, we compare the factors produced by the SAS/R code with those from the new Python code. Function behavior relative to the SAS portions is close to identical, with changes in standardized_accounting_data for more precise handling of duplicates.
US factor comparison
First, we examine correlations between U.S. long–short factor-portfolio returns produced by SAS/R and Python. In this setting, we aim for a correlation of 1 for each factor.
Figure 1: Correlation of US factor portfolios produced by Python and SAS/R.
As shown above, all factors have correlations very close to 1.
Next, we compare the mean returns and standard deviations of the factors. In the scatter plots below, the x-axis shows the statistic for SAS/R and the y-axis shows the corresponding statistic for Python. If the two versions match, the points should lie on a 45-degree line through the origin (slope 1, intercept 0).
Figure 4: Comparison of mean and standard deviation of World Ex US factor portfolios produced by Python and SAS/R.
Overall, the mean and standard deviation of the factor returns are close to those produced in SAS/R.
Comparison of characteristics at stock level
At a more granular level, we can also compare the correlation of the characteristics produced by the SAS portion of the previous version of the code and Python. For that purpose, we compute the Spearman rank correlation of each of the characteristics.
The plot below shows a histogram of the rank correlation:
Code
corr_df = pl.read_parquet('data/sas_vs_py_summ_stats.parquet')fig, ax = plt.subplots(figsize=(9, 5))# Histogram (capture counts and patches)n, bins, patches = ax.hist( corr_df["spearman r-corr"], bins=10, density=False, color="royalblue", edgecolor="black", linewidth=0.5, alpha=0.8,)# X limitsax.set_xlim(0.994, 1.0005)# Labels & titleax.set_xlabel(r"$\rho$", fontsize=14)ax.set_ylabel("n", fontsize=14)ax.set_title("Distribution of Spearman rank correlations", pad=20, fontsize=16)# Mean line and labelmean_r = corr_df["spearman r-corr"].mean()ax.axvline( mean_r, color="red", linestyle="--", linewidth=1.3,)# Put mean text *above* the tallest bar, horizontalmax_count = n.max() iflen(n) >0else1ax.text( mean_r, max_count *1.3, # 30% above tallest barf"Mean = {mean_r:.3f}", color="red", ha="center", va="bottom", fontsize=9,)# Numbers on top of each barfor count, patch inzip(n, patches):if count <=0:continue# skip empty bins (log scale hates 0) x = patch.get_x() + patch.get_width() /2 y = patch.get_height() ax.text( x, y *1.05, # slightly above barf"{int(count)}", ha="center", va="bottom", fontsize=8, )# Light grid on y-axisax.grid(axis="y", linestyle="--", alpha=0.4)# Clean up spinesfor spine in ["top", "right"]: ax.spines[spine].set_visible(False)# Log scale on yax.set_yscale("log")fig.tight_layout()plt.show()
Figure 5: Histogram of Spearman rank correlations.
What this shows is that if characteristics are used to sort stocks, the resulting portfolios should be roughly identical to those produced in SAS.
The following table shows a more detailed comparison of the distribution of the characteristics produced by SAS/R and Python.
Code
import pandas as pdfrom itables import init_notebook_mode, show# Make DataFrames render as interactive tablesinit_notebook_mode(all_interactive=True)# Read the Parquet filedf = pd.read_parquet("data/sas_vs_py_summ_stats.parquet").sort_values(['spearman r-corr', 'characteristic'], ascending = [False, False])# Remove the index (reset and drop)df = df.reset_index(drop=True)# Show an interactive, scrollable tableshow( df, scrollY="400px", # vertical scroll area scrollX=True, # horizontal scroll#paging=True, # enable paging#maxBytes=0 # <-- disable downsampling so you see all rows/cols)
Loading ITables v2.5.2 from the init_notebook_mode cell...
(need help?)
For every characteristic, the table reports Pearson and Spearman correlations between the output of the two pipelines (SAS/R and Python), followed by their respective means, standard deviations, skewness, kurtosis, and key quantiles (1st percentile, 25th percentile, median, 75th percentile, and 99th percentile).
There are only 2 characteristics with Pearson correlation lower than 95%: ‘resff3_6_1’ and ‘resff3_12_1’. Looking at rank correlation and other distributional measures, we can see that even if the Pearson correlation is small, the distributions align closely together and the difference in moments and correlations is due to outliers.
In the new Python code, the ‘div’ and ‘eqnpo’ characteristics in the market_chars_monthly subroutine are coerced to 0 whenever their magnitude falls below 1e-5, in order to eliminate arithmetic noise.
For a fair comparison of characteristic outputs, we have likewise set ‘div’ and ‘eqnpo’ to 0 in the SAS results whenever their magnitued is smaller than 1e-5. However, note that the portfolio return comparisons shown at the beginning were made using the previous SAS output, before applying the zero-floor adjustments to ‘div’ and ‘eqnpo’.
Main sources of differences:
Numerical precision: Differences in floating-point handling between SAS and Python can alter inequality evaluations and calculated values.
Data revisions: Underlying data may have changed.
Algorithmic adjustments: Differences due to updates in standardized_accounting_data for improved duplicate handling.